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Abstract 

Bayesian networks are basic graphical models, used widely both in statistics and artificial 
intelligence. These statistical models of conditional independence structure are described by 
acyclic directed graphs whose nodes correspond to (random) variables in consideration. A 
quite important topic is the learning of Bayesian network structures, which is determining the 
best fitting statistical model on the basis of given data. Although there are learning methods 
based on statistical conditional independence tests, contemporary methods are mainly based 
on maximization of a suitable quality criterion that evaluates how good the graph explains the 
occurrence of the observed data. This leads to a nonlinear combinatorial optimization problem 
that is in general TVP-hard to solve. 

In this paper we deal with the complexity of learning restricted Bayesian network struc- 
tures, that is, we wish to find network structures of highest score within a given subset of 
all possible network structures. For this, we introduce a new unique algebraic representative 
for these structures, called the characteristic imset. We show that these imsets are always 
0-1-vectors and that they have many nice properties that allow us to simplify long proofs for 
some known results and to easily establish new complexity results for learning restricted Bayes 
network structures. 



1 Introduction 

Bayesian networks are basic graphical models, used widely both in statistics |13j and artificial 
intelligence |18j . These statistical models of conditional independence structure are described by 
acyclic directed graphs whose nodes correspond to (random) variables in consideration. 

A quite important topic is learning Bayesian network structures j!7j . which is determining the 
statistical model on the basis of given data. Although there are learning methods based on statistical 
conditional independence tests, contemporary methods are mainly based on maximization of a 
suitable quality criterion or score function Q(G,D) of the (acyclic directed) graph G and the 
(given = fixed) data D, evaluating how good the graph G explains the occurrence of the observed 
data D. This leads to a nonlinear combinatorial optimization problem that is AA'P-hard (5J \7\. 
Below we will consider learning restricted Bayesian network structures. Some of these problems 
remain TV'P-hard while others are polynomial-time solvable. 
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It may happen that two different acyclic directed graphs describe the same statistical model, that 
is, they are Markov equivalent. A classic result J9[ [25] says that two acyclic directed graphs are 
Markov equivalent if and only if they have the same underlying undirected graph and the same 
set of immoralities (= special induced subgraphs a — > c <— b over three nodes {a, &, c} with no 
arc between a and b in either direction). In order to remove this ambiguity of Markov equivalent 
models/graphs, one is interested in having a unique representative for each Baycsian network 
structure (= statistical model). A classic unique graphical representative is the essential graph [1] 
of the corresponding Markov equivalence class of acyclic directed graphs, which is a special graph 
allowing both directed and undirected edges (see Section 12.21 for more details) . 

Any reasonable score function should be score- equivalent [5], that is, Q(G,D) = Q(H,D) for 
any two Markov equivalent graphs G and H . Another standard technical requirement is that the 
criterion has to be (additively) decomposable into contributions from the parent sets pa G (i) of each 
node i [B] (see Section 12.21 for more details) . 

In this paper, we deal with learning (restricted) decomposable models [T3], interpreted as Bayesian 
network structures. Decomposable models are exactly those models whose essential graph is an 
undirected (and thus also necessarily chordal) graph. That is, decomposable models correspond to 
graphical models without immoralities. As input to our learning problem we assume that we are 
given an undirected graph K and an evaluation oracle for the score function Q(-, D). Note that we 
do not assume the actual data D being part of the input itself. Of course, the evaluation oracle uses 
the given data D in order to evaluate score function values. However, in our treatment, we remove 
the complexity of evaluating score function values from the overall complexity. In particular, this 
means that the (large or small) number of data vectors in D will be irrelevant for our complexity 
results. 

We show that learning spanning trees of K and learning forests in K are both polynomial-time 
solvable. For learning spanning trees of K , this observation has been already made in [8] for specific 
score functions. Moreover, we show that if we impose degree bounds deg(u) < k on all nodes v G N 
for some constant k > 2, then both problems become A/'P-hard. We also show that learning chordal 
subgraphs of K is ATP-hard. This result, however, has been already shown even for specific score 
functions and also for the case of fixed bounded size of possible cliques [5D]. We include our short 
proof to emphasize the simplicity and usefulness of our approach to easily recover also hardness 
results. 

We will rewrite the nonlinear combinatorial optimization problem behind the learning problem 
into a linear integer optimization problem (in higher dimension) by using an algebraic approach to 
the description of conditional independence structures |21j that represents them by certain vectors 
with integer components, called imsets (short for "Integer Multi-SETs"). In the context of learning 
Bayesian networks this led to the proposal to represent each Bayesian network structure uniquely 
by a so-called standard imset. The advantage of this algebraic approach is that every reasonable 
score function (score equivalent and decomposable), becomes an affine function of the standard 
imset (see Chapter 8 in [21]). Moreover, it has recently been shown in [23] that the standard 
imsets over a fixed set of variables are exactly the vertices of their convex hull, the standard imset 
polytope. These results allow one to apply the methods of polyhedral geometry in the area of 
learning Bayesian networks, because they transform this task to a linear programming problem. 

Instead of considering standard imsets, we introduce a different unique representative that is ob- 
tained from the standard imset by an invertiblc affine linear map that preserves lattice points in 
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both directions. We call these new representatives characteristic imsets, as they are 0-1-vectors 
and as they also contain, for each acyclic directed graph, the characteristic vector of the underly- 
ing undirected graph. Although, mathematically, this map is simply a change in coordinates, the 
characteristic imset is much closer to the graphical description because it allows one to identify 
immediately both the underlying undirected graph and the immoralities. Our procedure for re- 
covering the essential graph from the characteristic imset is much simpler than the reconstruction 
from the standard imset as presented in |22j . 

Moreover, due to the affine transformation, every reasonable score function is also an affine function 
of the characteristic imset. Thus, learning Bayesian network structures can be reduced to solving a 
linear optimization problem over a certain 0-1-polytope. Unfortunately, a complete facet description 
for this polytope (for general \N\) is still unknown. A conjectured list of all facets for the standard 
imset polytope (and consequently also for the characteristic imset polytope) is presented in [24] . 
A complete facet description is also unknown for the convex hull of all characteristic imsets of 
undirected chordal graphs, although the characteristic imsets themselves arc well-understood in 
this case (see Section 

To summarize, we offer a new method for analyzing the learning procedure through an algebraic 
way of representing statistical models. We believe that our approach via characteristic imsets brings 
a tremendous mathematical simplification that allows us to easily recover known results and to 
establish new complexity results. We also think that a better understanding of the polyhedral 
properties of the characteristic imset polytope (complete facet description or all edge directions) 
will lead to future applications of efficient (integer) linear programming methods and software in 
this area of learning Bayesian network structures. 



2 Basic concepts 

We tacitly assume that the reader is familiar with basic concepts from polyhedral geometry. We 
only recall briefly the definitions of concepts mentioned above, but skip their statistical motivation. 

Throughout the paper TV is a finite non-empty set of variables; to avoid the trivial case we assume 
\N\ > 2. In statistical context, the elements of N correspond to random variables in consideration; 
in graphical context, they correspond to nodes. 

2.1 Graphical concepts 

Graphs considered here have a finite non-empty set of nodes N and two types of edges: directed 
edges, called arcs (or arrows in machine learning literature), denoted by i — > j (or j <— i), and 
undirected edges. No loops or multiple edges between two nodes are allowed. 

A set of nodes C C N is a clique (or a complete set) in G if every pair of distinct nodes in C is 
connected by an undirected edge. An immorality in a graph G is an induced subgraph (of G) for 
three nodes {a, b, c} in which a — > c <— b and a and b arc not adjacent. An undirected graph is called 
chordal, if every (undirected) cycle of length at least 4 has a chord, that is, an edge connecting 
two non-consecutive nodes in the cycle. A forest is an undirected graph without undirected cycles. 
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A connected forest over N is called a spanning tree. By the degree deg G (z) of a node i G N (in an 
undirected graph G), we mean the number of edges incident to i in G. 

Note that an undirected graph is chordal if and only if all its edges can be directed in such a way 
that the result is an acyclic directed graph without immoralities (see §2.1 in |13j). 

Occasionally, we will use the (in the machine learning community) commonly used acronym "DAG" 
for "directed acyclic graph" , although the grammatically correct phrase is "acyclic directed graph" . 

2.2 Learning Bayesian network structures 

In statistical context, to each variable (= node) i G N is assigned a finite (individual) sample 
space Xj (= the set of possible values); to avoid technical problems assume |Xj| > 2, for each 
i G N . A Bayesian network structure defined by a DAG G (over N) is formally the class of discrete 
probability distributions P on the joint sample space riieAr ^ that are Markovian with respect to 
G. Note that P is Markovian with respect to G if it satisfies conditional independence restrictions 
determined by the respective separation criterion (see p~3l 118] ). 

Different DAGs over N can be Markov equivalent, which means they define the same Bayesian 
network structure. The classic graphical characterization of (Markov) equivalent graphs is this: 
they are equivalent if and only if they have the same underlying undirected graph and the same 
immoralities (see [1]). The classic unique graphical representative of a Bayesian network structure 
is the essential graph G* of the respective (Markov) equivalence class Q of acyclic directed graphs: 
one has a — > b in G* if this arc occurs in every graph from Q and it has an undirected edge between 
a and b in G* if one has a — > b in one graph and b — > a in another graph (from Q). A less informative 
(unique) representative is the pattern pat(G) (of any G in Q): it is obtained from the underlying 
graph of G by directing (only) those edges that belong to immoralities (in G). 

Learning a Bayesian network structure means to determine it on the basis of an observed (complete) 
database D (of length £ > 1), which is a sequence x\, . . . , xi of elements of the joint sample space. 
D is called complete if all components of the elements xi, . . . ,xi are known. A quality criterion is 
a real function Q of two variables: of an acyclic directed graph G and of a database D. A learning 
procedure consists in maximizing the function G > Q(G, D) for given fixed D. Since the aim 
is to learn a Bayesian network structure, the criterion should be score equivalent, which means, 
Q(G, D) = Q(H, D) for any pair of Markov equivalent graphs G, H and for any database D. A 
standard technical requirement [6] is that the criterion has to be (additivcly) decomposable, which 
means, it can be written as follows: 



where Da denotes the projection of the database D to Hipa^ (f° r M ^ ^ N) and q^g for 
i G N, B C N \ {{] are real functions. 

Finally, let us remark that the essential graph G* of a DAG G is an undirected graph if and only if 
G has no immoralities. Consequently, every cycle in the undirected graph underlying G (which must 
coincide with G*) of length at least 4 must contain a chord (otherwise there exists an immorality on 
this cycle in G). Therefore, if an essential graph is undirected it has to be chordal. Conversely, if G* 
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is chordal, G cannot have an immorality. Therefore, learning decomposable models can be viewed 
as learning (special) Bayesian network structures corresponding to chordal undirected essential 
graphs P]. 

2.3 Algebraic approach to learning 

An imset over N is a vector in Z 2 '™' , whose components are indexed by subsets of N. Traditionally, 
all subsets of N are considered, although in Section [3] we also consider imsets with a restricted do- 
main (components corresponding to the empty set and to singletons are dropped, since they linearly 
depend on the other components). Every vector in M 2 1 can be written as a (real) combination of 
basic vectors 5a G {0, l} 2 '™': 

Sa(T) = ^ 1 q xtcn't^A for T C TV (if A C iV is fixed). 

This allows us to give formulas for imsets. Given an acyclic directed graph G over N, the standard 
imset for G is given by 

UG = 0~N - 5$ + { <5pa G (i) ~ <5{i}Upa G (i) } , (2.1) 

ieN 

where the basic vectors can cancel each other. It is a unique algebraic representative of the cor- 
responding Bayesian network structure because uq = uh if and only if G and H are Markov 
equivalent (Corollary 7.1 in [21]). The convex hull of the set of all standard imsets over TV is the 
standard imset polytope. 

An important result from the point of view of an algebraic approach to learning Bayesian network 
structures is that any score equivalent and decomposable quality criterion (= score function) Q 
has the form 

Q(G,D) = 8%-(t%,ua), (2.2) 

where (*, *) denotes the scalar product, and both G R and t§ G R 2 ' ' only depend on the 
database D and the chosen quality criterion (see Lemmas 8.3 and 8.7 in [21]). In particular, the 
task to maximize Q is equivalent to finding the optimum of a linear function over the standard 
imset polytope. 

3 Characteristic imsets 

In this section we introduce the notion of a characteristic imset and prove some useful facts about 
it. For example, we show that this imset is always a 0-1 vector. 

Definition 3.1 Given an acyclic directed graph G over N, let uq be the standard imset for G. 
We introduce 

portrait [itc] := ( portrait[uc] (T) )tcn,\t\>i 

GZ^HiVM with 
portrait [u G ] (T) := u g(X) for T C N, \T\ > 1, 

XCN-.TCX 



5 



and call portrait [ug] the upper portrait of u G or, simply, of G. 
Moreover, we will call 

cg := 1 - portrait[M G ] G Z^'H^I-i 

the characteristic imset of G. 

Characteristic imsets are unique representatives of Markov equivalence classes. This is because the 
standard imset are unique representatives and because the upper portrait map is an affine linear 
map that is invertible. The inverse map is given by the well-known Mobius inversion formula [3]- 
In fact, both maps assign lattice points to lattice points! 

Characteristic imsets have remarkable properties and, as we will show below, their entries directly 
encode the underlying undirected graph and the immoralities of the given acyclic directed graph. 

Theorem 3.2 Let G be an acyclic directed graph over N . For any T C N , \T\ > 1 we have 
cg(T) £ {0, 1} and c G (T) = 1 iff there exists some i 6 T with T \ {i} C pa G (i). In particular, 

CG £{o,iy ,Nl -\ N \-\ 

Proof. Consider the defining formula (|2.ip for the standard imset. For any T C N, \T\ > 1, the 
value portrait [ug] (T) can be computed as 

portrait [u G ] (T) = ]T u G (X) = l+ ^ 1- ^ 1. 

XCN-.TCX ieJV:TCpa(i) i£N:TCpei(i)u{i} 

Hence, we get 

c G (T) = 1 - portrait [u G ] (T) 

E !- E 1 

ie7V:TCpa(i)U{i} iS7V:TCpa(i) 

E 

i£N:TCpa,{i)U{i} ,i£T 

E i- 

i6T:T\{i}Cpa(i) 

For fixed T, assume that there are two different elements i,j e T with T \ {i} C pa G (i) and 
T \ {j} C pa G (j). This implies both i £ pa G (j) and j £ pa G (i). The simultaneous existence of 
the arcs i — > j and j — > i, however, contradicts the assumption that G is acyclic. Thus, for each 
T C N, there is at most one i G T with T \ {i} C pa G (i). Consequently, 

cg(T)= 2 1 e ^°' 1 ^ 

iGT:T\{i}Cpa(i) 

and thus c G e{0,l} 2 ' , "-l w l- 1 . □ 

Corollary 3.3 For any N , the only lattice points in the standard imset polytope and in the char- 
acteristic imset polytope are their vertices. 
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Proof. The statement holds for any 0-1-polytope and thus in particular also for the characteristic 
imset polytope. Moreover, the portrait map and its inverse, the Mobius map, are afHne linear maps 
between uq and cq that map lattice points to lattice points. Thus, the result holds also for the 
standard imset polytope. □ 

Remark. Corollary 13.31 (for standard imsets) has already been stated and proved in [24] . The 
original proof of this result in the manuscript of [21] was quite long and complicated. Later dis- 
cussions among the authors of the present paper led to the much simpler proof using the portrait 
map which was then also used in the final version of [21] ■ Corollary 13.31 also implies that the set of 
standard imsets is exactly the set of all vertices of the standard imset polytope, again simplifying 
the lengthy proof from [23] . □ 

Given a chordal undirected graph G, the corresponding characteristic imset cq is defined as the 
characteristic imset of any DAG G Markov equivalent to G. The observation that characteristic 
imsets are unique representatives of Markov equivalence classes makes the definition correct. 

Corollary 3.4 Let G be an undirected chordal graph over N . Then, for T C N , \T\ > 1, we have 
Cq(T) = 1 if and only if T is a clique in G. 

Proof. As G is the essential graph of an acyclic directed graph with no immoralities, we can 
direct the edges of G in such a way that we obtain an equivalent acyclic directed graph G^ with 
no immoralities. To show the forward implication, let T C N be given with cq(T) = 1. As 
c-g(T) = cq{T) = 1, there exists some i G T such that T \ {i} C pa^(i). Assume now, for a 

contradiction, that there are two nodes j,k G T \ {i} that are not connected by an edge in G 
(and hence j and k arc not connected in G). Then, however, j — > i 4— k is an immorality in G, a 
contradiction. Hence, all nodes in T \ {i} must be pairwise connected by an edge in G. As they are 
all connected in G by an edge to i, T is a clique in G. To show the converse implication, let T C N 
be a clique in G. Note that in G*, being an acyclic directed graph, the set T must contain a node 
i such that for all j G T the edge {i,j} G G is directed towards i in G. But then T \ {i} C pa-g(i) 
and therefore, cq{T) = 1 by Theorem l3.2l □ 

Applying this observation to special undirected chordal graphs, namely to undirected forests, we 
obtain the following characterization. 

Corollary 3.5 Let G be an undirected forest having N as the set of nodes. Then, for T C N , 
\T\ > 1, we have cq{T) = 1 if and only if T is an edge of G, or, in other words, 

- - ( * ( o G> 

where x(G) denotes the characteristic vector of the edge-set of G. 

Indeed, the only cliques of cardinality at least two in a forest are its edges. A similar result, in fact, 
holds for any acyclic directed graph G. 
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Corollary 3.6 Let G be a DAG over N and G its underlying undirected graph. Then for any 
two- element subset {a, b} C N, we have cc({a, b}) = 1 if and only if a —> b or b —> a is an edge of 
G, or, in other words, 




where * denotes the remaining components of cq- 

Proof. This is an easy consequence of Theorem 13.21 If cg(T) = 1 for T = {a,b} then the only 
i G T with T \ {i} C pa G (i) are either a or b. □ 

Thus, cg is an extension of the characteristic vector x(G) °f the edge-set of G, which motivated 
our terminology Let us now show how to convert cq back to the pattern graph pat(G) of G. 

Theorem 3.7 Let G be an acyclic directed graph over N and a,b G N are distinct nodes. Then 
the following holds: 

(1) a,b G N are connected in G iff cq({cl, b}) = 1, otherwise cg({a, b}) = 0. 

(2) a — > b belongs to an immorality in G iff there exists some i G N \ {a, b} with Cc({a, b, i}) = 1 
and ca({a,i}) — 0. The latter condition implies cc({a, b}) = 1 and cc{{b, i}) = 1. 

Proof. The condition (1) follows from Corollary 13.61 and Theorem 13.21 

For (2) assume that a — >• b <— i is an immorality in G. Then cg({a, b, i}) = 1 by Theorem 13.21 and 
the necessity of the other conditions follows from (1). Conversely provided that cc({a, b, i}) = 1, 
one of the three options a— > i b, i — >a-<— b and a — > b <— i (with possible additional edges) 
occurs. Now, cc({a,i}) = implies that a and i are not adjacent in G, which excludes the first 
two options and implies a — > b ^— i to be an immorality. □ 

Corollary 3.8 Let G be a DAG over N. The characteristic imset Cq is determined uniquely by 
its values for sets of cardinality 2 and 3. 

Proof. By Theorem l3 . 71 these values determine both the edges and immoralities in G. In particular, 
they determine the pattern pat(G). As explained in Section 12. 2\ this uniquely determines the 
Bayesian network structure and, therefore, the respective standard and characteristic imsets. □ 

More specifically, the components of cq for \S\ > 4 can be derived iteratively from the components 
for | S < 3 on the basis of the following lemma. A further simple consequence of the lemma below 
is that the entries for \S\ > 4 are not linear functions of the entries for 15*1 < 3. 

Lemma 3.9 Let G be a DAG over N, and S C N, \S\ > 4. Then the following conditions are 
equivalent. 

(a) c G (S) = I, 
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Figure 1: Orientation rules for getting the essential graph. 

(b) there exist \S\ — 1 subsets T of S with \T\ = 151 — 1 and cq{T) = 1, 

(c) there exist three subsets T of S with \T\ — \S\ — 1 and cg(T) = 1. 

In the proof, by a terminal node within a set T C N we mean igT such that there is no j G T\{z} 
with i — > j in G. 

Proof. The implication (a) — >• (b) simply follows from Theorem 13.21 (b) — > (c) is trivial. To show 
(c) — > (a) we first fix a terminal node i within S. Now, (c) implies there exist at least two sets 
rcS, \T\ = 151-1 which contain i. Let f be one of them. Since c G (T) = 1 by Theorem EE2| 
there exists k € T with j — !> fc for every j 6 T \ {fc}. If i ^ fc, then i — > k, which contradicts i to 
be terminal in S. Thus, i = k. Since, those two sets T cover S one has j —> i for every j S S \ {i} 
and Theorem 13.21 implies cc{S) = 1. □ 

Theorem 13 . 71 allows us to reconstruct the essential graph for G. Indeed, the conditions (1) and (2) 
directly characterize the pattern graph pat(G). However, in general, there could be other arcs in 
the essential graph. Fortunately, there is a polynomial graphical algorithm transforming pat(G) 
into the corresponding essential graph G* . More specifically, Theorem 3 in [14] says that provided 
pat(G) is the pattern of an acyclic directed graph G the repeated (exhaustive) application of the 
orientation rules from Figure Q] gives the essential graph G* . 

Finally, we wish to point out that Theorems 13.21 and 13.71 directly lead to a procedure for testing 
whether a given vector cGZ 2 ' 1 — I w I — 1 is a characteristic imset for some (acyclic directed graph) 
G over N. Using both theorems, one first constructs a candidate pattern graph, then a candidate 
essential graph, and then from it a candidate acyclic directed graph G. It remains to check whether 
the characteristic imsct of G coincides with the given vector c. 



4 Learning restricted Bayesian network structures 

A lot of research is devoted to the topic of finding complexity results of the general problem of 
learning Bayesian network structures analyzing different optimization strategies, scoring functions 
and representations of data. For example, Chickering, Heckerman and Meek show the large-sample 
learning problem to be A/'P-hard even when the distribution is perfectly Markovian [7]. On the 
other side Chickering [5] shows learning Bayesian network structures to be A/'T-'-complete when 
using a certain Bayesian score. This remains valid even if the number of parents is limited to a 
constant. 
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Our assumptions 



A reduction in complexity could be achieved by limiting the possible structures the Bayesian 
network can have. In the following, we will restrict our attention to learning decomposable models, 
that is, learning the best DAG among all DAGs whose essential graphs are undirected (and thus 
also chordal). In fact, we assume that we are given an undirected graph K over TV with an edge-set 
£(K), not necessarily the complete graph, and we wish to learn a DAG G that maximizes the 
quality criterion and whose essential graph is an (undirected) subgraph of if of a certain type. In 
particular, we are interested in learning undirected forests and spanning trees with and without 
degree bounds and in learning undirected chordal graphs. 

We wish to point out here that we make minimal assumptions on the database D and on the 
quality criterion to be optimized. We only assume that the database D over TV is complete, that 
is, no data entry has a missing/unknown component (see Section [2.2[) . For the quality criterion (= 
score function) we require that it is score equivalent and decomposable. In fact, instead of having 
D and an explicit score function available, we only assume that we are given an evaluation oracle 
(depending on D) that, when queried on G, returns the value Q(G, D). Clearly, especially for larger 
databases D, computing a single score function value Q(G, D) may be expensive. By assuming a 
given evaluation oracle, we give constant costs to score function evaluations in our complexity 
results below. 

Finally, we wish to remind the reader that under our assumptions learning the best DAG rep- 
resenting D becomes the problem of maximizing a certain linear functional (whose components 
depend on D) over the characteristic imsets (see Section |2"U]) . However, as this linear problem is 
in (exponential) dimension 2'^ — \N\ — 1, we cannot employ this transformation directly in our 
complexity treatment. 

4.1 Learning undirected forests and spanning trees 

By Corollary 13.51 we know that every DAG whose essential graph is an undirected forest G has 



to finding a maximum weight forest G as a subgraph of K. The same argumentation holds for 
learning undirected spanning trees of K. These are two well-known combinatorial problems that 
can be solved in polynomial time via greedy- type algorithms (see e.g. §40 in [H]). We conclude 
the following statement. 

Lemma 4.1 Given a node set N, an undirected graph K = (N,£(K)) and an evaluation oracle 
for computing Q(G,D). The problems of finding a maximum score subgraph of K that is 

(a) a forest, 

(b) a spanning tree, 

can be solved in time polynomial in \N\. 
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Although K is being part of the input, we need not state the complexity dependence with respect 
to the encoding length of K explicitly here, since the encoding length (K) of K is at least \N\. 
Moreover, we have (K) G 0(|iV| 2 ). 

Chow and Liu [8] provided a polynomial time procedure (in \N\) for maximizing the maximum log- 
likclihood criterion which finds an optimal dependence tree (= a spanning tree). The core of their 
algorithm is the greedy algorithm and they apply it to a non-negative objective function. For their 
result, the complexity of computing the probabilities from data (and hence the objective/score 
function) is also omitted. A similar result was obtained by Heckerman, Geiger and Chickering 
for the Bayesian scoring criterion. Our result combines all of these previous results by only 
supposing a decomposable and score equivalent quality criterion. 

We wish to point out here that the well-known GES algorithm [5] [TS], which was designed to 
learn general Bayesian network structures, could be modified in a straight-forward way to learn 
undirected forests (among the subgraphs of K). Then the first phase of this new GES-type algo- 
rithm coincides with the greedy algorithm to find a maximum weight forest and the second phase 
of the algorithm cannot remove any edge. Thus, the modified GES algorithm always finds a best 
undirected forest (among the subgraphs of K) in time polynomial in |7V|. 

4.2 Learning undirected forests and spanning trees with degree bounds 

Although the problems of learning undirected forests and of learning undirected spanning trees are 
solvable in polynomial time, learning an undirected forest/spanning tree with a given degree bound 
deg G (i) < k < \N\ — 1, Mi G N, is J\fP-lmrd. For k = 1 this problem is equal to the well-known 
problem of finding a maximum weight matching in K , which is in the general case polynomial time 
solvable (see §30 in [19]). 

Theorem 4.2 Given a node set N , an undirected graph K = (N, £{K)) and an evaluation oracle 
for computing Q{G,D). Moreover, let k G Z + be a constant with 2 < k < \N\ — 1. Then the 
following statements hold. 

(a) The problem of finding a maximum score subgraph of K that is a forest and that fulfils the 
degree bounds deg(z) < k, Vi G N , is MV-hard (in \N\) for any fixed (strictly) positive score 
function Q(., D). 

(b ) The problem of finding a maximum score spanning tree of K that fulfils the degree bounds 
deg(z) < k, Vz G N , is NT -hard (in \N\) for any fixed score function Q(., D). 

Remark. Again, we have removed the explicit dependence on (K), since (K) G 0(|7V| 2 ). 

Proof. We deduce part (b) from the following feasibility problem. In §3.2.1 of [10], it has been 
shown that the task: 

Bounded degree spanning tree 

Instance: An undirected graph K and a constant 2 < k < \N\ — 1 

Question: "Is there a spanning tree for K in which no node has degree exceeding fc?" 
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is AfP-complete by reduction onto the Hamiltonian path problem. 

Part (a) now follows by considering the subfamily of problems in which the linear objective takes 
only (strictly) positive values and, thus, every optimal forest (with the bounded degree) is a span- 
ning tree. Hence, the problem of finding a maximum- weight forest (with a given degree bound) 
is equivalent to finding a maximum- weight spanning tree (with a given degree bound). As the 
feasibility problem for the latter is already ./VP-complete, part (a) follows. □ 

We wish to remark that Meek [16] shows a similar hardness result for learning paths, i.e. spanning 
trees with upper degree bound k = 2 for the maximum log-likelihood, the minimum description 
length and the Bayesian quality criteria. 

4.3 Learning chordal graphs 

Undirected chordal graph models are the intersection of DAG models and undirected graph models, 
known as Markov networks [21] . In this section, we show that learning these models is A/'P-hard. 

Theorem 4.3 Given a node set N, an undirected graph K = (N,£(K)) and an evaluation oracle 
for computing Q(G, D). The problem of finding a maximum score chordal subgraph of K is MV- 
hard (in \N\). 

Proof. We show that we can polynomially transform the following NV-h&rd problem to learning 
undirected chordal graphs. 

Clique of given size 

Instance: An undirected graph K and a constant 2 < k < \N\ — 1 
Question: "Is there a clique set in K of size at least fc?" 

We define now a suitable learning problem that would solve this problem. By Corollary 13.41 we 
know that for any chordal graph G the entry cq(T) is 1 if and only ifTCA,|T|>lisa clique 
(otherwise this entry is 0). Thus, the score function value for G is determined by the values of 
the linear objective function w T x for the cliques T in G. In particular, we can define the values 
for the cliques in such a way that when transforming the learning problem to the problem of 
maximizing w T x over the characteristic imsct polytope (= the convex hull over all characteristic 
imsets) the entries w(T) are when \T\ < k and are positive when \T\ > k. This implies that the 
maximum score among all chordal subgraphs of K is positive iff there exists a chordal subgraph 
in K containing a clique T of size \T\ > k. This happens iff K has a clique of size at least k. □ 



4.4 Learning chordal graphs with bounded size of cliques 

Let us consider a variation of the previous task by introducing an upper bound k for the size of 
cliques. If k < 2, we get the problems of learning undirected forests/matchings, which we already 
know are solvable in polynomial time (see Section I4TT1 and Section |4~2"]) . 

For k > 2, the corresponding problem is A/""P-hard already for a fixed type of score function. This 
has been shown by Srebro [20] for the maximized log-likelihood criterion (as a generalization of 
the work by Chow and Liu [8]). 
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5 Conclusions 



Let us summarize the main contributions of the paper. We introduced characteristic imsets as 
new simple representatives of Bayesian network structures, which are much closer to the graphical 
description. Actually, there is an easy transformation from the characteristic imset into the (es- 
sential) graph. Last but not least, the insight brought by the use of characteristic imsets makes it 
possible to offer elegant combinatorial proofs of (known and new) complexity results. The proofs 
avoid special assumptions on the form of the quality criterion besides the standard assumptions of 
score equivalence and decomposability. 

In our future work, we plan to apply these tools in the linear programming approach to learning. 
For this purpose we would like to find a general linear (facet-) description of the corresponding 
characteristic imset polytopc or, at least, of a suitable polyhedral relaxation containing exactly the 
characteristic imsets as lattice points. Finding suitable polyhedral descriptions is also interesting 
and important for learning restricted families of Bayesian network structures, for example, for 
learning undirected chordal graphs. 

Finally, let us remark that a polyhedral approach to learning Bayesian network structures (using 
integer programming techniques) has been also suggested by Jaakkola et.al. |12j . but their way 
of representing DAGs is different from ours. Their representatives live in dimension |7V| • 2 ]>N \~ 1 
and correspond to individual DAGs, while ours live in dimension 2^1 — \N\ — 1 and correspond to 
Markov equivalence classes (of DAGs). 
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